In this part we firstly analyse the short-comings of previous approaches, their possible causation and then experiment with a few other classifiers namely, random forests, linear support vector machines, rbf/gaussian support vector machines. For each of the classifiers, we aim to find the hyper-parameters through which the model(s) is(are) able to address the short-comings of previous methods.
Our currently best classifier: xgboost has the following classification report for training:
Here we are clearly able to see that our highly tuned model is not able to pick the minority label == 1(goals) (low recall) but has a high accuracy for the ones that it picks up (precision). One of the major contributors to this challenge can be the imbalance in the labels in our training data.
We observe that our training dataset is hugely imbalanced where the ratio is approximately: shots: goals :: 10: 1. The effects of this imbalance are observed in both training and testing (validation) phases where we observe that the recall for label == 1: goal is extremely low although, we are getting a decent overall accuracy of >90% for almost all our classifiers. Our hypothesis is that the classifier is just "getting by" learning the representation for shots label == 0 and goals are getting neglected.
From this we establish that the overall accuracy of the mode is not a good metric to compare the models/results. Thus from here we decide to pay attention to precision, recall and the f1 score of our labels.
We see that with linear SVC, we are not able to predict any goals (label ==1) and thus have a recall and f1 score of zero. To overcome this, we try to change the kernel and experiment with the Gaussian / RBF kernel in the next section.
Since we have a lot more samples than features, it is safe to experiment with the RBF kernel. WIth the help of a non-linear kernel, we are able to see some improvement in both precision, recall and thus in the f1 score as well for label1. We aso experiment with the regularization perimeter: c and parameter for kernel-spread: gamma but variations in metrics for the testing/validation phase are minimal.
Insert image caption: change in regularization parameter: c:
Seeing improvement in the f1-score with a non-linear kernel, we extend our experiments with a non-linear classifier: random forests. We first experiment with the parameter max_depth of the classifier. We observe that by increasing the value of max_depth the curve starts overfitting and the accuracy on validation set drops. Hence, we decided to set the max_depth to its minimum value of 2 and start to tune the parameter n_estimators to help with the generalization of our classifier.
We observe that although the roc_auc_scores for training and validation are similar, there is minimal increase in the roc_auc_score on the validation set as we increase the number of estimators.
Unlike previous scenarios, where we were getting either zero or minimal recall for label 1, by making the parameter class_weight = ‘balanced’ :
i.e, weighting the classes by the inverse of their frequency, we are able to make the model to have identity labels and we now see a massive increase in recall with the value of 0.73 for label 1. However, the correctness of identification has room for improvement at the precision is just 0.15 and the overall accuracy of the model also drops down from ~90% to ~60%. As previously discussed, we believe that the overall accuracy of the model is not a strong metric to judge the learning capacity of the classifier as with an imbalanced dataset like ours, the classifier can just achieve very high accuracies by always predicting the majority class, examples of which we saw in the previous cases.
From here, we focus on the metrics: precision and recall for labels: {0, 1}.
After selecting the number of estimators = 400, we experiment with the model-parameter: max_depth in the effort to improve the precision of our model.